Target detection method, apparatus, and system

ABSTRACT

A target detection method and apparatus, in which the method includes: obtaining a target candidate region in a to-be-detected image; determining at least two part candidate regions from the target candidate region by using an image segmentation network, where each part candidate region corresponds to one part of a to-be-detected target; and extracting, from the to-be-detected image, local image features corresponding to the part candidate regions; and learning the local image features of the part candidate regions by using a bidirectional long short-term memory LSTM network, to obtain a part relationship feature used to describe a relationship between the part candidate regions; and detecting the to-be-detected target in the to-be-detected image based on the part relationship feature. As a result, image data processing precision in target detection can be improved, application scenarios of target detection can be diversified, and target detection accuracy can be improved.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of International Application No.PCT/CN2019/072015, filed on Jan. 16, 2019, which claims priority toChinese Patent Application No. 201810094901.X, filed on Jan. 30, 2018,the disclosures of the aforementioned applications are herebyincorporated by reference in their entireties.

TECHNICAL FIELD

Embodiments of this application relate to the field of big data, and inparticular, to a target detection method and apparatus.

BACKGROUND

In the historical background of safe city establishment, search-by-imagebecomes one of important technical means to assist a public securityorganization (for example, the people's public security) to quicklylocate a crime location and a movement track of a target (for example, acriminal suspect). Search-by-image is finding image data including atarget from a massive amount of surveillance video data by using a queryimage including the target, and determining, based on the image dataincluding the target, information such as a time and a location at whichthe target appears in a surveillance video, so as to determine amovement track of the target. Search-by-image includes two processes:target database establishment and target query. In the target databaseestablishment process, target detection and tracking first need to beperformed on a massive number of videos, and each piece of image data ofthe target is extracted from massive video data, to create a targetdatabase used for search-by-image. In the target query process, an inputquery image including a target is compared with the image data includedin the target database, to locate information such as a time and alocation at which the target appears in the massive videos.

In a target detection manner in the prior art, a target candidate regionis extracted from an image, then the target candidate region is dividedinto rectangular image blocks of a fixed size, and further, rectangularimage blocks of different quantities at different locations are combinedto obtain a possible part region of a target. For example, assuming thatthe target is a pedestrian, possible part regions of the target that areobtained through rectangular image block division and recombinationinclude a head, a shoulder, a left body, a right body, a leg, and thelike. In the prior art, the target candidate region is divided based ona rectangular image block. As a result, division precision is low, arelatively large amount of interference information is included in eachrectangular image block obtained through division, and accuracy ofreflecting a posture change or a blocking status of the target is low.Consequently, the target detection manner has poor applicability.

SUMMARY

Embodiments of this application provide a target detection method andapparatus, to improve image data processing precision in targetdetection, diversify application scenarios of target detection, andimprove target detection accuracy, so that target detection method andapparatus have higher applicability.

According to a first aspect, an embodiment of this application providesa target detection method, and the method includes the following. Atarget candidate region in a to-be-detected image and a global imagefeature corresponding to the target candidate region are obtained.Herein, the target candidate region may include a plurality of regionsof a target (the target herein may be understood as a to-be-detectedtarget in actual detection), and the plurality of regions include aregion that actually includes the target, and also include a region thatmay include the target but actually does not include the target. Theglobal image feature herein is an image feature corresponding to thetarget candidate region. The global image feature is an image featureextracted by using the target as a whole, and the image feature may alsobe referred to as an overall image feature. Part candidate regionsrespectively corresponding to at least two parts are determined from thetarget candidate region by using an image segmentation network, andlocal image features corresponding to the part candidate regions areextracted from the to-be-detected image. The local image feature hereinis an image feature extracted for a local detail such as a part of thetarget, and one part candidate region corresponds to one group of localimage features. The local image features corresponding to the partcandidate regions are learned by using a bidirectional long short-termmemory LSTM network, to obtain a part relationship feature used todescribe a relationship between the part candidate regions. Theto-be-detected target in the to-be-detected image is detected based onthe part relationship feature.

In some implementations, the relationship between the part candidateregions includes at least one of a relationship between the detectedtarget and the part candidate regions, or a dependency relationshipbetween the part candidate regions. The relationship between thedetected target and the part candidate regions includes: a relationshipthat is between a same detected target to which the part candidateregions belong and the part candidate regions and that exists when thepart candidate regions belong to the same detected target, and/or arelationship that is between each of the part candidate regions and adetected target to which the part candidate region belongs and thatexists when the part candidate regions belong to different detectedtargets.

In this embodiment of this application, prediction results that are ofpixels and that correspond to parts may be divided by using the imagesegmentation network, to obtain a part candidate region corresponding toeach part. Pixels whose prediction results belong to a same part may begrouped into a part candidate region corresponding to the part. Whenparts of the target are identified and divided by using the imagesegmentation network, and a pixel-level image feature may be identified,so that part division has higher division accuracy, thereby varyingscenarios, such as target posture change scenarios, to which the targetdetection method is applicable. Therefore, the target detection methodhas higher applicability. In this embodiment of this application, afterobtaining the local image features of higher division accuracy areobtained by using the image segmentation network, the relationshipbetween the part candidate regions is learned by using the bidirectionalLSTM, so that not only an obvious location relationship between the partcandidate regions can be learned, but also an implied part relationshipbetween the part candidate regions can be analyzed, and the implied partrelationship includes the following. The parts belong to a same detectedpedestrian, the parts belong to different detected pedestrians, or thelike. Therefore, part identifiability is increased when a posture of theto-be-detected target in the to-be-detected image is changed or theto-be-detected target in the to-be-detected image is blocked, therebyimproving target detection accuracy. In this embodiment of thisapplication, the part relationship feature obtained by learning thelocal images feature corresponding to the part candidate regions may beused to determine whether the to-be-detected image includes the target.This is simple to operate, and target identification efficiency is high.

In some implementations, when detecting the to-be-detected target in theto-be-detected image based on the part relationship feature, theto-be-detected target in the to-be-detected image may be determinedbased on the part relationship feature with reference to the globalimage feature. In this embodiment of this application, the partrelationship feature may be further merged with the global image featureto detect the to-be-detected target in the to-be-detected image, so asto avoid interference caused by a part division error, and improvetarget detection accuracy.

In some implementations, the part relationship feature may be mergedwith the global image feature, and a first confidence level of each of acategory and a location of the to-be-detected target in theto-be-detected image is obtained through learning based on a mergedfeature. A second confidence level at which the target candidate regionincludes the to-be-detected target is determined based on the globalimage feature, and it is determined, based on merging of the firstconfidence level and the second confidence level, that theto-be-detected image includes the to-be-detected target. Further, alocation of the to-be-detected target in the to-be-detected image may bedetermined based on a location, in the to-be-detected image, of thetarget candidate region including the to-be-detected target. In thisembodiment of this application, the first confidence level is used todetermine, at a part layer of a target, whether the to-be-detected imageincludes the target and a prediction result of a target location. Thesecond confidence level is used to determine, at a layer of an entiretarget, whether the to-be-detected image includes the target and aprediction result of a target location. When the second confidence levelis greater than or equal to a preset threshold, it may be determinedthat the target candidate region is a region including the target; orwhen the second confidence level is less than a preset threshold, it maybe determined that the target candidate region is a background regionthat does not include the target. In this embodiment of thisapplication, the first confidence level is merged with the secondconfidence level, so that a more accurate prediction result may beobtained based on a prediction result corresponding to the firstconfidence level and with reference to a prediction result correspondingto the second confidence level, thereby improving prediction precisionof target detection. In this embodiment of this application, the globalimage feature, of the target candidate region, in the to-be-detectedimage may be merged with the part relationship feature between the partcandidate regions, and the global image feature is merged with the localimage feature to obtain a more rich feature expression, so that a moreaccurate detection result may be obtained, thereby improving targetdetection accuracy. Therefore, the target detection method has higherapplicability.

In some implementations, when the local image features of the partcandidate regions are learned by using the bidirectional LSTM, the localimage features of the part candidate regions may be sorted in a presetpart sequence to obtain a sorted feature sequence, and the featuresequence is input to the bidirectional LSTM. The relationship betweenthe part candidate regions is learned by using the bidirectional LSTMand by using a binary classification problem distinguishing between atarget and a background as a learning task. Herein, the binaryclassification problem distinguishing between a target and a backgroundmay be understood as a classification problem used to determine whethera part candidate region is a region including a target or a region thatdoes not include a target (that is, a background is included), where thetwo cases are counted as two classes: a target and a background. Forease of description, the classification problem may be briefly referredto as the binary classification problem distinguishing between a targetand a background. The preset part sequence may be a preset partarrangement sequence, for example, a head, a left arm, a right arm, aleft hand, a right hand, an upper body, a left leg, a right leg, a leftfoot, and a right foot. The preset part sequence may be specificallydetermined according to a requirement in an actual application scenario,and is not limited herein. In this embodiment of this application, whenthe relationship between the part candidate regions is learned by usingthe bidirectional LSTM, a learning objective is set for thebidirectional LSTM network, that is, a binary classification problemused to determine whether a part candidate region is a target or abackground is used as the learning objective, to obtain, throughlearning, the part relationship feature used to indicate therelationship between the part candidate regions. Further, theto-be-detected target may be detected by using the part relationshipfeature that is obtained by using the bidirectional LSTM throughlearning and that is used to describe the relationship between the partcandidate regions. This is simple to operate, and target detectionefficiency may be improved.

According to a second aspect, an embodiment of this application providesa target detection method, and the method includes the following. Atarget candidate region in a to-be-detected image and a global imagefeature corresponding to the target candidate region are obtained. Apositive sample image feature and a negative sample image feature thatare used for part identification are obtained, and a part identificationmodel is constructed based on the positive sample image feature and thenegative sample image feature. It may be understood that the partidentification model herein is a network model that has a capability ofobtaining a local image feature of a target part, and a specificexistence form of the part identification model is not limited herein.Part candidate regions respectively corresponding to at least two partsare determined from the target candidate region by using the partidentification model, and local image features corresponding to the partcandidate regions are extracted from the to-be-detected image. The localimage features of the part candidate regions are learned by using abidirectional long short-term memory LSTM network, to obtain a partrelationship feature used to describe a relationship between the partcandidate regions. A to-be-detected target in the to-be-detected imageis detected based on the part relationship feature.

In this embodiment of this application, the part identification modelthat has a part identification capability may be constructed by usingthe positive sample image feature and the negative sample image featurethat are used for part identification, and a local image featurecorresponding to each part may be extracted from the to-be-detectedimage by using the part identification model, so as to diversify mannersof identifying a target part, diversify manners of obtaining a partcandidate region and a local image feature, and diversify manners ofimplementing target detection. Therefore, the target detection methodhas higher applicability.

In some implementations, when the positive sample image feature and thenegative sample image feature that are used for part identification areobtained, a candidate box template in which a target is used as adetected object may be first obtained. The candidate box template isdivided into N grids, and a grid covered by a region in which each partof the target is located is determined from the N grids, where N is aninteger greater than 1. Then, a positive sample image and a negativesample image used for target detection may be obtained by using thecandidate box template from sample images used for target partidentification. It may be understood that the candidate box template maybe a pre-constructed template used for part identification functiontraining, and the template is applicable to part identification functiontraining of the part identification model used for target detection.Further, a sample image used for part identification may be obtained,and a plurality of candidate regions in which a target is used as adetected object are determined from the sample image. Then, a candidateregion labeled with the target in the plurality of candidate regions isdetermined as a positive sample region of the target, and a candidateregion whose intersection-over-union with the positive sample region isless than a preset proportion is determined as a negative sample regionof the target. Herein, intersection-over-union of two regions may beunderstood as a ratio of an area of an intersection of the two regionsto an area of union of the two regions. The positive sample region isdivided into N grids, a positive sample grid and a negative sample gridthat correspond to each part are determined from the N grids of thepositive sample region based on the candidate box template. The negativesample region is divided into N grids, and a grid that is in the N gridsof the negative sample region and that corresponds to each part isdetermined as a negative sample grid of the part. Further, an imagefeature of a positive sample grid region of each part is determined as apositive sample image feature of the part, and an image feature of anegative sample grid region of each part is determined as a negativesample image feature of the part.

In this embodiment of this application, a positive sample image featureand a negative sample image feature that correspond to each part of atarget may be determined by using massive sample images. A partidentification model with higher part identification precision may beobtained by training a large quantity of positive sample image featuresand negative sample image features of each part. Therefore, when a localimage feature of each part is extracted from the to-be-detected image byusing the part identification model, extraction precision of an imagefeature corresponding to the part can be improved, accuracy of partsegmentation of the target is improved, and manners of extracting thelocal image feature of each part is also diversified.

In some implementations, when the positive sample grid and the negativesample grid corresponding to each part are determined from the N gridsof the positive sample region based on the candidate box template, apart grid covered by each part may be determined from the N grids of thepositive sample region based on a grid that is in the candidate boxtemplate and that is covered by a region in which the part is located.When a part grid covered by any part i includes a part grid j, and adegree at which a region covered by the part i in the part grid joverlaps a region of the part grid j is greater than or equal to apreset threshold, the part grid j is determined as a positive samplegrid of the part i. By analogy, a positive sample grid of each part maybe determined, where both i and j are natural numbers. Alternatively,when a part grid covered by any part i includes a part grid j, and adegree at which a region covered by the part i in the part grid joverlaps a region of the part grid j is less than a preset threshold,the part grid j is determined as a negative sample grid of the part i.By analogy, a negative sample grid of each part may be determined.Herein, the degree at which the region covered by the part i overlapsthe region of the part grid j may be a ratio of an area of a visibleregion that is of the part i and that is included in the part grid j toan area of the part grid j. The visible region of the part i is a regionthat is in a positive sample region in the to-be-detected image and thatis covered by the part i, and the region covered by the part i mayinclude one or more of the N grids, that is, any part may cover one ormore grids in the positive sample region.

In this embodiment of this application, a grid covered by each part maybe selected from a positive sample image in the to-be-detected image byusing the candidate box template, and then an image featurecorresponding to the grid may be used as a positive sample image featureused to train the part identification model. In this way, precision ofselecting a positive sample image feature is higher, so thatinterference data in sample data used to train the part identificationmodel may be reduced, thereby improving part identification accuracy ofthe part identification model obtained through training, and improvingtarget detection accuracy in the to-be-detected image.

In some implementations, when the part identification model isconstructed by using the positive sample image feature and the negativesample image feature in the sample image, the positive sample imagefeature of each part in the sample image and the negative sample imagefeature of each part may be used as input of the part identificationmodel. A capability of obtaining a local image feature of the part isobtained by using the part identification model and by using a binaryclassification problem distinguishing between a target part and abackground as a learning task. Therefore, this is simple to operate, andthe target detection method has higher applicability.

In some implementations, when the to-be-detected target in theto-be-detected image is determined based on the part relationshipfeature, the part relationship feature may be merged with a global imagefeature, and a first confidence level of each of a category and alocation of the to-be-detected target in the to-be-detected image isobtained through learning based on a merged feature. A second confidencelevel at which the target candidate region includes the to-be-detectedtarget is determined based on the global image feature, and it isdetermined, based on merging of the first confidence level and thesecond confidence level, that the to-be-detected image includes theto-be-detected target. Further, a location of the to-be-detected targetin the to-be-detected image may be determined based on a location, inthe to-be-detected image, of the target candidate region including theto-be-detected target. Herein, when the location of the to-be-detectedtarget in the to-be-detected image is determined, a location, in theto-be-detected image, of a target candidate region that actuallyincludes the to-be-detected target may be determined as the location ofthe to-be-detected target. This is easy to operate and is highlyfeasible.

In some implementations, when the local image features of the partcandidate regions are learned by using the bidirectional LSTM, the localimage features of the part candidate regions may be sorted in a presetpart sequence to obtain a sorted feature sequence, and the featuresequence is input to the bidirectional LSTM. Then the relationshipbetween the part candidate regions may be learned by using thebidirectional LSTM and by using a binary classification problemdistinguishing between a target and a background as a learning task. Inthis embodiment of this application, the relationship between the partcandidate regions includes at least one of a relationship between thedetected target and the part candidate regions, or a dependencyrelationship between the part candidate regions. The relationshipbetween the detected target and the part candidate regions includes: arelationship that is between a same detected target to which the partcandidate regions belong and the part candidate regions and that existswhen the part candidate regions belong to the same detected target,and/or a relationship that is between each of the part candidate regionsand a detected target to which the part candidate region belongs andthat exists when the part candidate regions belong to different detectedtargets.

According to a third aspect, an embodiment of this application providesa target detection apparatus, and the apparatus includes units and/ormodules that are configured to perform the target detection methodprovided in any one of the first aspect and/or the possibleimplementations of the first aspect. Therefore, beneficial effects (oradvantages) of the target detection method provided in the first aspectcan also be implemented.

According to a fourth aspect, an embodiment of this application providesa target detection apparatus, and the apparatus includes units and/ormodules that are configured to perform the target detection methodprovided in any one of the second aspect and/or the possibleimplementations of the second aspect. Therefore, beneficial effects (oradvantages) of the target detection method provided in the second aspectcan also be implemented.

According to a fifth aspect, an embodiment of this application providesa terminal device, and the terminal device includes a memory and aprocessor. The memory is configured to store a group of program code.The processor is configured to invoke the program code stored in thememory, to perform the target detection method provided in any one ofthe first aspect and/or the possible implementations of the firstaspect. Therefore, beneficial effects of the target detection methodprovided in the first aspect can also be implemented.

According to a sixth aspect, an embodiment of this application providesa computer device, and the computer device may be a terminal device oranother type of computer device. The computer device includes a memoryand a processor, and may further include an input/output device, acommunications interface, and the like. The memory is configured tostore a group of program code. The processor is configured to invoke theprogram code stored in the memory, to perform the target detectionmethod provided in any one of the second aspect and/or the possibleimplementations of the second aspect. Therefore, beneficial effects ofthe target detection method provided in the second aspect can also beimplemented.

According to a seventh aspect, an embodiment of this applicationprovides a computer readable storage medium, and the computer readablestorage medium stores an instruction. When the instruction is run on acomputer, the computer performs the target detection method provided inany one of the first aspect and/or the possible implementations of thefirst aspect, and beneficial effects of the target detection methodprovided in the first aspect can also be implemented.

According to an eighth aspect, an embodiment of this applicationprovides a computer readable storage medium, and the computer readablestorage medium stores an instruction. When the instruction is run on acomputer, the computer performs the target detection method provided inany one of the second aspect and/or the possible implementations of thesecond aspect, and beneficial effects of the target detection methodprovided in the second aspect can also be implemented.

According to a ninth aspect, an embodiment of this application providesa target detection apparatus. The target detection apparatus may be achip or a plurality of chips working in cooperation. The targetdetection apparatus includes an input device coupled to the targetdetection apparatus, and the target detection apparatus is configured toperform the technical solution provided in the first aspect of theembodiments of this application.

According to a tenth aspect, an embodiment of this application providesa target detection apparatus. The target detection apparatus may be achip or a plurality of chips working in cooperation. The targetdetection apparatus includes an input device coupled to the targetdetection apparatus, and the target detection apparatus is configured toperform the technical solution provided in the second aspect of theembodiments of this application.

According to an eleventh aspect, an embodiment of this applicationprovides a target detection system. The target detection system includesa processor, configured to support a target detection apparatus inimplementing a function in the first aspect, for example, generating orprocessing information in the target detection method provided in thefirst aspect. In a possible design, the target detection system furtherincludes a memory, and the memory is configured to store a programinstruction and data that are necessary for the target detectionapparatus. The target detection system may include a chip, or mayinclude a chip and another discrete part.

According to a twelfth aspect, an embodiment of this applicationprovides a target detection system. The target detection system includesa processor, configured to support a target detection apparatus inimplementing a function in the second aspect, for example, generating orprocessing information in the target detection method provided in thesecond aspect. In a possible design, the target detection system furtherincludes a memory, and the memory is configured to store a programinstruction and data that are necessary for the target detectionapparatus. The target detection system may include a chip, or mayinclude a chip and another discrete part.

According to a thirteenth aspect, an embodiment of this applicationprovides a computer program product including an instruction, and whenthe computer program product is run a computer, the computer performsthe target detection method provided the first aspect, and beneficialeffects of the target detection method provided in the first aspect canalso be implemented.

According to a fourteenth aspect, an embodiment of this applicationprovides a computer program product including an instruction, and whenthe computer program product is run a computer, the computer performsthe target detection method provided the second aspect, and beneficialeffects of the target detection method provided in the second aspect canalso be implemented.

In the embodiments of this application, image data processing precisionin target detection can be improved, application scenarios of targetdetection can be diversified, and target detection accuracy can beimproved. Therefore, the target detection method and apparatus havehigher applicability.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram of a system architecture to which a targetdetection method according to an embodiment of this application isapplicable;

FIG. 2 is a schematic structural diagram of a target detection apparatusaccording to an embodiment of this application;

FIG. 3 is a schematic flowchart of a target detection method accordingto an embodiment of this application;

FIG. 4 is a schematic diagram of a pedestrian candidate region accordingto an embodiment of this application;

FIG. 5 is another schematic diagram of a pedestrian candidate regionaccording to an embodiment of this application;

FIG. 6A is a schematic diagram of a pedestrian part according to anembodiment of this application;

FIG. 6B is a schematic diagram of a pedestrian part according to anembodiment of this application;

FIG. 7 is a schematic diagram of a to-be-detected image processingprocedure in a pedestrian detection method according to an embodiment ofthis application;

FIG. 8 is another schematic flowchart of a target detection methodaccording to an embodiment of this application;

FIG. 9 is another schematic diagram of a pedestrian candidate regionaccording to an embodiment of this application;

FIG. 10 is another schematic diagram of a pedestrian candidate regionaccording to an embodiment of this application;

FIG. 11 is another schematic diagram of a pedestrian candidate regionaccording to an embodiment of this application;

FIG. 12 is another schematic structural diagram of a target detectionapparatus according to an embodiment of this application; and

FIG. 13 is another schematic structural diagram of a target detectionapparatus according to an embodiment of this application.

DESCRIPTION OF EMBODIMENTS

A target detection method and apparatus provided in the embodiments ofthis application are applicable to detected targets that include but arenot limited to a pedestrian, an animal, a vehicle, an article carriedaround by people, and the like. This is not limited herein. Detection ofthe targets such as the pedestrian or the animal further includesdetection of different postures of the pedestrian or the animal, ordetection performed when a part of a body of the pedestrian or theanimal is blocked. This is not limited herein. The article carriedaround by the pedestrian may include a portable controlled tool and thelike concerned about in a safe city. This is not limited herein. Forease of description, in a subsequent description of the embodiments ofthis application, a to-be-detected target (or referred to as a target)is described by using a pedestrian.

Currently, in the field of pedestrian detection, a deep learning-basedpedestrian detection method based on a deep neural network model is oneof the main methods for pedestrian detection. In the deep learning-basedpedestrian detection method based on the deep neural network model, afeature extraction network model is first constructed, so as to extractfeatures from a pedestrian region and a background region in a detectedimage by using the feature extraction network model. Then, a pedestriandetection problem is converted into a classification problem (for easeof description, in the following, the classification problem is brieflyreferred to as a binary classification problem used to determine whethera pedestrian is included or a pedestrian is not included (that is, abackground is included), where the two cases are counted as two classes,or is briefly referred to as a binary classification problemdistinguishing between a pedestrian and a background) used to determinewhether a region includes a pedestrian (a region including a pedestrianis classified as a pedestrian region, and a region including nopedestrian is classified as a background region), and a regressionproblem used to determine a specific location (used to indicate aspecific location of a pedestrian in the detected image), in thedetected image, of a region including the pedestrian, so as to design anoptimization function. Further, the feature extraction network model maybe trained with reference to a large amount of sample image data usedfor pedestrian detection, to obtain parameters of the following parts inthe feature extraction network model: a feature extraction part, aclassification part, and a regression (that is, locating) part. When ato-be-detected image is input to the feature extraction network model,the feature extraction part in the feature extraction network model isfirst used to extract an image feature, and extract a pedestriancandidate region based on the feature. Then the classification part inthe feature extraction network model is used to determine whether eachpedestrian candidate region includes a pedestrian, and the regressionpart in the feature extraction network model is used to determine aspecific location, in the to-be-detected image, of the pedestrian ineach pedestrian candidate region including the pedestrian, so as tocomplete pedestrian target detection in the to-be-detected image.

The embodiments of this application provide a pedestrian detectionmethod and apparatus (namely, a target detection method and apparatus,where a target detection method and apparatus in which a pedestrian isused as a target is used as an example) in which an overall imagefeature of a pedestrian is merged with a local image feature of apedestrian part. In the pedestrian detection method and apparatusprovided in the embodiments of this application, a deep learningframework is used to detect a pedestrian, and a local image feature of apedestrian part is obtained based on the deep learning framework byusing an image segmentation network, so that pedestrian part division ismore accurate and a quantity of pedestrian parts may be flexiblyadjusted according to an actual situation, extraction precision of alocal image feature of a pedestrian part is higher, and operations aremore flexible. In addition, in the pedestrian detection method providedin the embodiments of this application, a long short-term memory (LSTM)network is used to learn a relationship between pedestrian parts, toobtain a feature (for ease of description, a part relationship featuremay be used below for description) used to describe the relationshipbetween the pedestrian parts. Finally, multi-task learning is used tomine a correlation between an overall image feature of a pedestrian anda local image feature of a pedestrian part, to efficiently share afeature, so that a pedestrian detection rate can be improved in acomplex pedestrian detection scenario, especially in a pedestriandetection scenario in which a pedestrian part is blocked seriously,thereby accurately identifying a pedestrian and a specific location ofthe pedestrian. Optionally, in the embodiments of this application,pedestrian parts may include a head, a trunk, a left arm, a left hand, aright arm, a right hand, a left leg, a left foot, a right leg, a rightfoot, and the like. This is not limited herein.

FIG. 1 is a schematic diagram of a system architecture to which a targetdetection method according to an embodiment of this application isapplicable. The target detection method provided in this embodiment ofthis application is applicable to a target detection system 10, forexample, a search-by-image system. The target detection system 10includes, but is not limited, to processing modules such as a processor11, a memory 12, a communications interface 13, an input device 14, anda display 15. The modules such as the processor 11, the memory 12, thecommunications interface 13, the input device 14, and the display 15 maybe connected by using a communications bus. This is not limited herein.The communications interface 13 is configured to communicate with anetwork element, to establish a communication connection between thetarget detection system 10 and the network element. The input device 14is configured to input to-be-processed data such as a surveillancevideo. The memory 12 may be configured to store data such as anoperating system, an application program, and a pedestrian detectionalgorithm. The processor 11 is configured to perform the pedestriandetection algorithm to implement pedestrian detection on theto-be-processed data. The display 15 may be configured to display apedestrian detection result. The memory 12 is further configured tostore the pedestrian detection result for the to-be-processed data.

The memory 12 is further configured to store a program. Specifically,the program may include program code, and the program code includes acomputer operation instruction. The memory 12 includes but is notlimited to a random access memory (RAM), a read-only memory (ROM), anerasable programmable read only memory (EPROM), or a compact discread-only memory (CD-ROM). Only one memory is shown in FIG. 1.Certainly, a plurality of memories may be disposed according to arequirement.

The memory 12 may be alternatively a memory in the processor 11. This isnot limited herein.

The processor 11 may be one or more central processing units (CPU). Whenthe processor 11 is one CPU, the CPU may be a single-core CPU, or may bea multi-core CPU. The processor 11 may be a general purpose processor, adigital signal processor (DSP), an application-specific integratedcircuit (ASIC), a field-programmable gate array (FPGA) or anotherprogrammable logic device, a discrete gate or transistor logic device,or a discrete hardware part; and may implement or perform the targetdetection method provided in the embodiments of this application. Thegeneral purpose processor may be a microprocessor, or the processor maybe any conventional processor or the like.

The input device 14 may include a surveillance camera, a camera of awireless terminal, or the like. This is not limited herein. The wirelessterminal may be a handheld device with a radio connection function, oranother processing device connected to a radio modem, and may be amobile terminal that communicates with one or more core networks byusing a radio access network. For example, the wireless terminal may bea mobile phone, a computer, a tablet computer, a personal digitalassistant (PDA), a mobile internet device (MID), a wearable device, oran e-book reader (e-book reader). For another example, the wirelessterminal may be a portable, pocket-sized, handheld, computer built-in,or vehicle-mounted mobile device.

The input device 14 is configured to input to-be-detected data, forexample, to-be-processed data such as a surveillance video and image(for ease of description, in this embodiment of this application, ato-be-detected image is used as an example for description). After asurveillance video is input to the target detection system 10, theprocessor 11 may be configured to perform a pedestrian detection methodprovided in the embodiments of this application, to detect a pedestrianin the surveillance video. After completing detection on the pedestrianin the surveillance video, the processor 11 may display a detectionresult on the display for a user to view. In addition, after thepedestrian in the surveillance video is detected, data in a databaseconstructed with reference to subsequent algorithms such as pedestriantracking and pedestrian image feature extraction may be further storedin the memory for subsequent query. An implementation of the pedestriandetection method performed by the processor is described below withreference to FIG. 1. The pedestrian detection method provided in theembodiments of this application is only an example of the targetdetection method in which a pedestrian is used as a target in theembodiments of this application. Specifically, the target detectionmethod provided in the embodiments of this application may also be usedto detect a target such as an animal or a vehicle. This is not limitedherein. For ease of description, the pedestrian detection method is usedas an example below for description, that is, a target to be detected inthe target detection method is described by using a pedestrian as anexample.

In some implementations, the pedestrian detection method provided in theembodiments of this application may be performed by a target detectionapparatus, for example, the processor 11 in the target detection system10. This is not limited herein.

FIG. 2 is a schematic structural diagram of a target detection apparatusaccording to an embodiment of this application. In this embodiment ofthis application, the target detection apparatus includes but is notlimited to a feature extraction unit 111, a target candidate regionextraction unit 112, an image segmentation unit 113, a part relationshiplearning unit 114, and a target prediction unit 115.

The feature extraction unit 111 is configured to obtain an imagefeature, in a to-be-detected image, for which a pedestrian is used as adetected target.

The target candidate region extraction unit 112 is configured to extractall possible pedestrian candidate regions from the to-be-detected imagebased on the image feature extracted by the feature extraction unit 111for the detected target, namely, the pedestrian. In this embodiment ofthis application, the pedestrian candidate region may be a specificallycandidate rectangular box region of the pedestrian. This is not limitedherein.

The feature extraction unit 111 is further configured to extract, fromthe image feature extracted from the to-be-detected image, an overallimage feature corresponding to the pedestrian candidate region. Theoverall image feature corresponding to the pedestrian candidate regionmay be an image feature extracted by using the pedestrian as a whole inthe pedestrian candidate region. This is not limited herein.

The image segmentation unit 113 is configured to extract, by using animage segmentation network from each pedestrian candidate regionextracted by the target candidate region extraction unit 112, a localimage feature that corresponds to each part and that is segmented forthe pedestrian part.

The image segmentation unit 113 is further configured to perform, basedon the extracted local image feature corresponding to each part, partsegmentation in the pedestrian candidate region extracted by the targetcandidate region extraction unit 112, to obtain a part candidate regionof the pedestrian. There may be one or more part candidate regionsobtained by dividing the pedestrian candidate region. A plurality ofpart candidate regions is used as an example for description in thisembodiment of this application.

The part relationship learning unit 114 is configured to learn, by usinga bidirectional LSTM, a relationship between part candidate regionsobtained by the image segmentation unit 113 through segmentation, toobtain a part relationship feature used to describe the relationshipbetween the part candidate regions. In this embodiment of thisapplication, the bidirectional LSTM is used to learn the relationshipbetween part candidate regions, so that not only an obvious locationrelationship between the part candidate regions can be learned, but alsoan implied part relationship between the part candidate regions can beanalyzed. For example, in a pedestrian candidate region, there arepedestrian parts such as an arm, a head, and a trunk, but the headbelongs to one pedestrian, and the arm and the trunk belong to anotherpedestrian. To be specific, when different pedestrians block each other,a bidirectional LSTM model may determine, to an extent, whether thepedestrian candidate region includes one pedestrian or a plurality ofpedestrians.

Optionally, in this embodiment of this application, the bidirectionalLSTM is used to learn the relationship between part candidate regions,so that not only a relationship between the detected target and the partcandidate regions can be learned, but also a dependency relationshipbetween the part candidate regions can be learned. The relationshipbetween the detected target and the part candidate regions may includethe following. The part candidate regions belong to a same detectedpedestrian, or the part candidate regions belong to different detectedpedestrians. In this embodiment of this application, when thebidirectional LSTM is used to determine the relationship between thepart candidate regions, if the part candidate regions belong to a samedetected pedestrian, a relationship between the same detected pedestrianand the part candidate regions may be learned. Optionally, in thisembodiment of this application, when the bidirectional LSTM is used todetermine the relationship between the part candidate regions, if thepart candidate regions belong to different detected pedestrians, arelationship between each of the part candidate regions and a detectedpedestrian to which the part candidate region belongs may be furtherlearned. For example, a plurality of part candidate regions separatelybelong to a head candidate region, a trunk candidate region, a legcandidate region, and the like of a detected pedestrian. Alternatively,a plurality of part candidate regions separately belong to differentdetected pedestrians, and include a head candidate region of a detectedpedestrian 1, a trunk candidate region of a detected pedestrian 2, a legcandidate region of the detected pedestrian 1, and the like. Therelationship between part candidate regions is merely an example, andmay be specifically determined according to a posture change or ablocking status of a to-be-detected pedestrian in a to-be-detected imagein an actual application scenario. This is not limited herein.

The target prediction unit 115 is configured to merge, at a featurelayer, the overall image feature extracted by the feature extractionunit 111 with the part relationship feature obtained by the partrelationship learning unit 114 through learning, to obtain a mergedfeature. For example, the overall image feature is merged with the partrelationship feature through series connection, and the merged featureis sent to a local classifier, to obtain a score based on the localclassifier. The score of the local classifier indicates a probability,determined by the local classifier based on the input feature, that apedestrian is included. Optionally, the target prediction unit 115 isfurther configured to: send the overall image feature to an overallclassifier used for pedestrian detection, to obtain a score based on theoverall classifier; and merge, at a classifier layer, the score of theoverall classifier used for the pedestrian detection with the score ofthe local classifier, to obtain a pedestrian detection result fordetection of the pedestrian in the to-be-detected image. In theimplementation in which the final pedestrian detection result isobtained by merging the score of the overall classifier for pedestriandetection with the score of the local classifier, interference caused bya part division error can be avoided, thereby improving accuracy ofpedestrian detection.

A specific implementation of the pedestrian detection method provided inthe embodiments of this application is described below with reference tothe target detection apparatus.

Embodiment 1

FIG. 3 is a schematic flowchart of a target detection method accordingto an embodiment of this application. The target detection methodprovided in this embodiment of this application may include thefollowing steps.

S301. Obtain a target candidate region in a to-be-detected image.

In some implementations, after the to-be-detected image is input to thetarget detection system 10 by using the input device 14, the featureextraction unit 111 extracts, from the to-be-detected image, an imagefeature for which a pedestrian is used as a detected object. In a deeplearning-based pedestrian detection method based on a deep neuralnetwork model, a deep feature of the to-be-detected image may be firstextracted by using a convolutional neural network (CNN), and then alocal region candidate box is extracted from the to-be-detected image byusing a region proposal network (RPN) based on the deep featureextracted by the convolutional neural network, for example, an externalrectangular box of a pedestrian that may include the pedestrian.Optionally, when extracting the image feature from the to-be-detectedimage, the feature extraction unit 111 may first use an original networkmodel in a deep learning framework as initialization of a target networkmodel used for pedestrian detection. Then the feature extraction unit111 may replace a classification problem of the original network modelwith a classification problem used to determine whether a region is apedestrian region including a pedestrian or a background region thatdoes not include a pedestrian (that is, a binary classification problemused to determine whether a pedestrian is included or a pedestrian isnot included (that is, a background is included), where the two casesare counted as two classes, and this is briefly referred to as a binaryclassification problem distinguishing between a pedestrian and abackground); train the original network model with reference to a sampleimage (or a sample data set) used for pedestrian detection, to constructthe target network model used for pedestrian detection. The constructedtarget network model used for pedestrian detection may be aconvolutional neural network, so that the constructed convolutionalneural network can be better adapted to a pedestrian detection task. Forexample, a visual geometry group network (VGG Net) obtained by trainingan ImageNet data set may be first selected as an original network modelused for training, and then a 1000-class classification problem in theoriginal ImageNet data set is replaced with the binary classificationproblem distinguishing between a pedestrian and a background. The VGGNet is trained with reference to the sample image used for pedestriandetection. An existing network model framework of the VGG Net is used toinitialize the VGG Net, and the sample image used for pedestriandetection is used to train, for the existing network model framework, afunction of distinguishing between a pedestrian and a background. Anetwork parameter of the VGG Net is adjusted by training the VGG Net, sothat the network parameter of the VGG Net is a network parameterapplicable to the pedestrian detection. This process may be referred toas a process in which the VGG Net model is finely adjusted to constructa network model used for pedestrian detection.

Optionally, the original network model in the deep learning frameworkmay further include network models such as Alex, GoogleNet, and ResNet.Specifically, the original network model may be determined according toa requirement in an actual application scenario. This is not limitedherein.

Optionally, after constructing the convolutional neural network used forpedestrian detection, the feature extraction unit 11 may extract, fromthe to-be-detected image by using the convolutional neural network, theimage feature for which a pedestrian is used as a detected object.Optionally, the image feature may be a deep feature at a lastconvolution layer of the convolution neural network, and the imagefeature may be an image feature used to describe whether ato-be-detected model includes a pedestrian, and is an image featureextracted by using the pedestrian as a whole. Therefore, for ease ofdescription, the image feature is also referred to as an overall imagefeature.

In some implementations, the target candidate region extraction unit 112extracts a target candidate region (namely, a pedestrian candidateregion, for ease of description, the pedestrian candidate region is usedas an example below for description) for which a pedestrian is used as adetected object from the to-be-detected image. In a process ofextracting the pedestrian candidate region, the target candidate regionextraction unit 112 may enumerate pedestrian candidate regions, in ato-be-detected image in an actual pedestrian detection applicationscenario, that may include a pedestrian by learning real given features(for example, some image features used to express what the pedestrianlooks like) of the pedestrian. FIG. 4 is a schematic diagram of apedestrian candidate region according to an embodiment of thisapplication. In this embodiment of this application, the pedestriancandidate regions in the to-be-detected image that are enumerated by thetarget candidate region extraction unit 112 include a region thatactually includes a pedestrian, and may also include a region that mayinclude a pedestrian but actually does not include the pedestrian. Asshown in FIG. 4, the target candidate region extraction unit 112 mayextract countless pedestrian candidate regions from the to-be-detectedimage, such as regions in densely distributed rectangular boxes in FIG.4, including regions that actually include a pedestrian, such as aregion 1, a region 2, a region 3, and a region 4 in white rectangularboxes. In this embodiment of this application, the pedestrian candidateregion may be specifically an image region corresponding to a candidaterectangular box of a pedestrian. This is not limited herein. For ease ofdescription, an example in which the pedestrian candidate region is aregion that actually includes a pedestrian may be used below fordescription. FIG. 5 is another schematic diagram of a pedestriancandidate region according to an embodiment of this application. Thetarget candidate region extraction unit 112 may extract, from theto-be-detected image, an external rectangular box of a pedestrian thatmay include the pedestrian, for example, an external rectangular box(namely, a rectangular box 1) of a pedestrian 1 and an externalrectangular box (namely, a rectangular box 2) of a pedestrian 2; anddetermine an image region corresponding to the rectangular box 1 and animage region corresponding to the rectangular box 2 as target candidateregions in which a pedestrian is used as a detected object.

Optionally, the target candidate region extraction unit 112 may firstobtain an initial RPN model. The initial RPN model has an initialnetwork parameter, and the initial RPN model may frame a foregroundregion or a background from an image by using the initial networkparameter of the initial RPN model. The target candidate regionextraction unit 112 may initialize the initial RPN model by using anetwork model framework of the initial RPN model, use a pedestriansample image used for pedestrian detection to train, for the initial RPNmodel, a function of using a pedestrian as a detected object. A networkparameter of the initial RPN model is adjusted by training the initialRPN model, so that a network parameter of a trained RPN model is anetwork parameter applicable to framing of a pedestrian or a background.This process may be referred to as a process in which the initial RPNmodel is finely adjusted to construct a network model applicable topedestrian detection. The RPN model obtained through training may be atarget PRN model used to implement a function of extracting a region inwhich a pedestrian is used as a detected object. The network parameterof the target PRN model may be obtained by training a pedestrian sampleimage for which a pedestrian is used as a detected object, and thereforethe target PRN model may be better applicable to pedestrian detection.After the target RPN model is constructed, a plurality of rectangularbox regions that may include a pedestrian may be determined from theto-be-detected image with reference to a window sliding method. Forexample, by using a window sliding method and with reference to the deepfeature (namely, the image feature extracted by using a pedestrian as awhole) extracted from the last convolution layer of the convolutionalneural network by using a convolution core of the convolutional neuralnetwork, frame-by-frame image feature sliding is performed on the deepfeature, and a confidence level at which an image feature slid each timeincludes a feature of the pedestrian, namely, the detected object, iscalculated in a sliding process. The confidence level at which the imagefeature slid each time includes the feature of the pedestrian, namely,the detected object, is a probability that the image feature includesthe pedestrian, and a higher confidence level indicates a higherprobability that the image feature includes the pedestrian. Theextracted deep feature is slid by using the sliding window method, sothat a plurality of candidate regions may be determined from theto-be-detected image (for ease of description, rectangular box regionsare used as an example below for description). An image in a region in arectangular box (which may be referred to as a rectangular box regionfor ease of description) is a candidate box feature map. After theplurality of rectangular box regions are determined from theto-be-detected image, a candidate box feature map may be selected fromcandidate box feature maps corresponding to the rectangular box regions,where a confidence level at which the selected candidate box feature mapincludes a pedestrian image feature is greater than or equal to a presetthreshold. A candidate region corresponding to the selected candidatebox feature map is used as a pedestrian candidate region (namely, thetarget candidate region), for example, an external rectangular box ofthe pedestrian. In this embodiment of this application, a confidencelevel at which a pedestrian candidate region includes a pedestrian mayalso be understood as a possibility that the to-be-detected image has adetected target such as the pedestrian in this region, and may not onlyindicate a possibility that the to-be-detected image includes thepedestrian, but also may indicate a possibility that the pedestrian isin this region. This is not limited herein. This region corresponds to alocation of the pedestrian in the candidate box feature map in theprocess of performing frame-by-frame image feature sliding on the deepfeature. The location may also be restored to a corresponding locationin the to-be-detected image, and then a location of the pedestrian inthe to-be-detected image may be determined.

Optionally, the image feature corresponding to the pedestrian candidateregion may be a global image feature extracted by using a pedestrian asa detected target, that is, an image feature of a candidate rectangularbox region of the pedestrian, for example, an image feature of a regionin the rectangular box 1 shown in FIG. 5. In this embodiment of thisapplication, the global image feature may also be referred to as anoverall image feature. The overall image feature may be a feature usedto indicate a pedestrian image in a candidate rectangular box region ofa pedestrian, that is, an image feature extracted by using thepedestrian as a whole. The overall image feature corresponds to thetarget candidate region, that is, corresponds to the candidaterectangular box region of the pedestrian herein. For ease ofdescription, the overall image feature is used as an example below fordescription. Corresponding to the image feature extracted by using apedestrian as a whole, an image feature extracted for a pedestrian partthat is a local pedestrian detail may be referred to as a local imagefeature. The overall image feature may be used to be merged with a partrelationship feature used to describe a relationship between parts of apedestrian, to determine whether the to-be-detected image includes apedestrian or a location of the included pedestrian in theto-be-detected image (for example, a location, in the to-be-detectedimage, of a candidate rectangular box region of the pedestrian thatincludes the pedestrian).

S302. Determine, by using an image segmentation network from the targetcandidate region, part candidate regions respectively corresponding toat least two parts, where each part candidate region corresponds to onepart of a to-be-detected target; and extract, from the to-be-detectedimage, local image features corresponding to the part candidate regions.

In some implementations, after determining the pedestrian candidateregion in the to-be-detected image, the target candidate regionextraction unit 112 may perform pedestrian part segmentation in thepedestrian candidate region by using the image segmentation unit 113, soas to determine, from the pedestrian candidate region, a part candidateregion corresponding to each pedestrian part. The pedestrian candidateregion may include a region that actually includes a pedestrian, and mayalso include a region that may include a pedestrian but actually doesnot include the pedestrian. This is not limited herein. For ease ofdescription, an example in which the pedestrian candidate region is aregion that actually includes a pedestrian may be used below fordescription. FIG. 6A is a schematic diagram of a pedestrian partaccording to an embodiment of this application. With reference to FIG.5, in the rectangular box 1 shown in FIG. 5, visible pedestrian partscorresponding to the pedestrian 1 include a head, a left hand, a leftarm, a trunk, a left leg, a left foot, a right leg, and a right footshown in FIG. 6A. For the pedestrian 1, a region corresponding to anyone of the pedestrian parts: the head, the left hand, the left arm, thetrunk, the left leg, the left foot, the right leg, and the right foot ofthe pedestrian 1 in FIG. 6A is a part candidate region corresponding tothe pedestrian part. For another example, FIG. 6B is a schematic diagramof a pedestrian part according to an embodiment of this application. Inthe rectangular box 2 shown in FIG. 5, because the pedestrian 1 and thepedestrian 2 directly block each other, visible pedestrian partscorresponding to pedestrian 2 include a head, a trunk, a right leg, aright foot, and a left foot shown in FIG. 6B. For the pedestrian 2, aregion corresponding to any one of the pedestrian parts: the head, thetrunk, the right leg, the right foot, and the left foot of thepedestrian 2 in FIG. 6B is a part candidate region corresponding to thepedestrian part of the pedestrian 2.

Optionally, if the pedestrian candidate region is a region that mayinclude a pedestrian but actually does not include a pedestrian, theimage segmentation unit 113 may perform pedestrian part segmentation inthe pedestrian candidate region through predictive determining, so as todetermine, from the pedestrian candidate region, a part candidate regioncorresponding to each pedestrian part. This may be specificallydetermined according to an actual application scenario, and is notlimited herein.

In some implementations, the image segmentation unit 113 may extract,from the to-be-detected image, the local image feature corresponding toeach part candidate region, so as to obtain a local image featurecorresponding to each pedestrian part. One pedestrian part candidateregion corresponds to one group of local image features, that is, onepedestrian part corresponds to one group of local image features.Optionally, the image segmentation unit 113 may construct an imagesegmentation network used for pedestrian detection, and obtain apedestrian part by using the image segmentation network, so thatpedestrian part division is more fine and a quantity of pedestrian partsmay be flexibly adjusted according to an actual application scenario,thereby more accurately capturing a posture change or a blocking statusof a pedestrian.

Optionally, the image segmentation network provided in this embodimentof this application may be a fully convolutional network (FCN). When thefully convolutional network is used to extract, from the to-be-detectedimage, the local image feature corresponding to each part, theto-be-detected image may be first input to the fully convolutionalnetwork, and prediction results that are of pixels in the to-be-detectedimage and that correspond to pedestrian parts are output by using thefully convolutional network. Further, the prediction results that are ofthe pixels and that correspond to the pedestrian parts may be divided byusing the fully convolutional network, to obtain a part candidate regioncorresponding to each part. Pixels whose prediction results belong to asame part may be grouped into a part candidate region corresponding tothe part. A part candidate region may be represented as a candidateregion of a part of a pedestrian, and therefore part candidate regionsof a plurality of parts of a plurality of pedestrians may be obtained.An image feature extracted from each part candidate region may be usedas a local image feature corresponding to a pedestrian part indicated bythe part candidate region. If a plurality of part candidate regions areextracted from the to-be-detected image, an image feature of each partcandidate region may be extracted and used as a local image featurecorresponding to each pedestrian part. For example, in FIG. 6A, the headof the pedestrian 1 corresponds to a part candidate region (for ease ofdescription, it may be assumed that the part candidate region is a partcandidate region 11), and an image feature of the part candidate region11 is a local image feature corresponding to the head of the pedestrian1. Likewise, an image feature extracted from a part candidate regioncorresponding to another pedestrian part of the pedestrian 1 is a localimage feature corresponding to the another pedestrian part of thepedestrian 1. An image feature extracted from a part candidate regioncorresponding to each pedestrian part of the pedestrian 2 is a localimage feature corresponding to the pedestrian part of the pedestrian 2.When pedestrian parts are identified and divided by using the fullyconvolutional network, a pixel-level image feature may be identified, toobtain a pixel-level local image feature. This has higher divisionprecision than conventional rectangular box part division, and thereforeis more fitted to a complex and variable actual scenario. Therefore, thetarget detection method has higher applicability.

Optionally, in this embodiment of this application, pedestrian parts mayinclude a head, a left arm, a right arm, a left hand, a right hand, atrunk (for example, an upper body), a left leg, a right leg, a leftfoot, a right foot, and the like. This is not limited herein. Theto-be-detected image may include one or more of the pedestrian parts.The image segmentation unit 113 may use the fully convolutional networkto extract, from the to-be-detected image, a local image featurecorresponding to each part, so as to determine a relationship betweenthe pedestrian parts by using the local image feature corresponding toeach part.

S303. Learn the local image features of the part candidate regions byusing a bidirectional LSTM, to obtain a part relationship feature usedto describe a relationship between the part candidate regions.

In some implementations, after obtaining the local image featurecorresponding to each pedestrian part, the image segmentation unit 113may learn a relationship between pedestrian parts by using the partrelationship learning unit 114. Optionally, the part relationshiplearning unit 114 sorts the local image features of the part candidateregions in a preset part sequence to obtain a sorted feature sequence,and inputs the feature sequence to the bidirectional LSTM. The presetpart sequence may be: a head, a left arm, a right arm, a left hand, aright hand, an upper body, a left leg, a right leg, a left foot, and aright foot. This may be specifically determined according to an actualapplication scenario requirement, and is not limited herein. The partrelationship learning unit 114 may learn the relationship between thepart candidate regions by using the bidirectional LSTM and by using abinary classification problem distinguishing between a pedestrian and abackground as a learning task. For example, the bidirectional LSTM mayobtain a relationship between regions through learning. A specific typeof a region may be determined according to an actual applicationscenario. For example, in a pedestrian detection application scenario,the relationship between the regions is specifically a relationshipbetween part candidate regions corresponding to pedestrian parts.Therefore, when the relationship between the regions is learned by usingthe bidirectional LSTM, a learning objective needs to be further set forthe bidirectional LSTM, that is, a binary classification problem used todetermine whether a part candidate region is a pedestrian or abackground is used as the learning objective. Further, the relationshipbetween the part candidate regions of the pedestrian parts may beobtained by the bidirectional LSTM through learning.

Optionally, the relationship between the part candidate regions includesa relationship between the to-be-detected target and the part candidateregions, and a dependency relationship between the part candidateregions. The relationship between the to-be-detected target and the partcandidate regions includes: a relationship that is between a samedetected target to which the part candidate regions belong and the partcandidate regions and that exists when the part candidate regions belongto the same to-be-detected target, and/or a relationship that is betweeneach of the part candidate regions and a to-be-detected target to whichthe part candidate region belongs and that exists when the partcandidate regions belong to different to-be-detected targets. Forexample, when the part candidate regions belong to a same to-be-detectedtarget, each of the part candidate regions belongs to which part iswhich one of a head, a trunk, or an arm of the detected target. Thedependency relationship between the part candidate regions may include aconnection relationship between parts corresponding to the partcandidate regions. For example, a head is connected to a trunk, a leftarm is connected to the left of the trunk, and a right arm is connectedto the right of the trunk. This is not limited herein.

Optionally, the part relationship learning unit 114 may model and learnthe relationship between the part candidate regions based on the localimage feature that corresponds to each pedestrian part and that isextracted by the image segmentation unit 113, and extract a feature thatmay describe the relationship between the pedestrian parts. For example,it is assumed that pedestrian parts may include 10 parts: a head, a leftarm, a right arm, a left hand, a right hand, a trunk, a left leg, aright leg, a left foot, and a right foot. The part relationship learningunit 114 may first construct a bidirectional LSTM model, combine localimage features that correspond to the parts and that are extracted bythe image segmentation unit 113, to obtain a feature sequence, input thefeature sequence to the constructed bidirectional LSTM model, and learna relationship between the pedestrian parts by using the bidirectionalLSTM model and by using a binary classification problem distinguishingbetween a pedestrian and a background as a learning objective.

In some implementations, the bidirectional LSTM provided in thisembodiment of this application includes a plurality of LSTM memoryunits, and parameters in the LSTM memory units may be determined byusing the following formulas 1 to 5.

The formulas 1 to 5 meet the following:

i _(t)=σ(W _(i) x _(t) +U _(i) h _(t−1))  (1)

f _(t)=σ(W _(f) x _(t) +U _(f) h _(t−1))  (2)

o _(t)'σ(W _(o) x _(t) +U _(o) h _(t−1))  (3)

c _(t) =f _(t) c _(t−1) +i _(t)φ(W _(c) x _(x) +U _(c) h _(t−1))  (4)

h _(t) =o _(t)φ(c _(t))  (5)

In the formulas 1 to 5, σ(x) and φ(x) both are non-linear activationfunctions, where

-   -   σ(x) is a sigmoid function and meets σ(x)32 (1+exp(−x))⁻¹; and        φ(x) is a tanh function and meets φ(x)=tanh(x).

In this embodiment of this application, the local image featurescorresponding to the part candidate regions are serially connected inthe preset part sequence to form the feature sequence, and the featuresequence is input to the bidirectional LSTM. Therefore, a local imagefeature that is input at a moment t corresponds to a part candidateregion, and therefore in the formulas 1 to 5, the variable t maycorrespond to a part candidate region.

x_(t) represents a local image feature that corresponds to a partcandidate region and that is input at the moment t; i_(t), f_(t), ando_(t) respectively represent probabilities, output at the moment t by aninput gate, a memory gate, and an output gate, that the local imagefeature that is input at the moment t corresponds to a pedestrian part.The input gate, the memory gate, and the output gate are collectivelyreferred to as logic gates of the LSTM memory unit. c_(t) representsinformation of a pedestrian part indicated by the local image featurethat is input at the moment t. For ease of description, the informationmay be referred to as information about the LSTM memory unit at acurrent moment t.

In the bidirectional LSTM network provided in this embodiment of thisapplication, during calculation of the information about the LSTM memoryunit at the current moment t and calculation of the probabilities thatare output by the logic gates (the input gate, the output gate, and thememory gate) in the LSTM memory unit, input x_(t) corresponding to eachpart candidate region at the current moment t and a weighttransformation matrix W of an implied variable h_(t−1) corresponding toeach part candidate region at a previous moment t−1 separately exist,for example, W_(i) corresponding to i_(t), W_(f) corresponding to f_(t)W_(o) corresponding to o_(t), W_(c) corresponding to c_(t), and thelike. The implied variable h_(t−1) may be determined by output of theoutput gate and the memory unit at the previous moment t−1. The impliedvariable is an invisible status variable, and is a parameter relative toan observable variable. The observable variable may include a featurethat can be directly obtained from the to-be-detected image. The impliedvariable is a variable at an abstract concept layer higher than aconcept layer of the observable variable, and the implied variable is aparameter that can be used to control a change of the observablevariable.

The bidirectional LSTM provided in this embodiment of this applicationis a network module that constantly uses context (context) informationof an input feature sequence. Therefore, data obtained throughprocessing at the current moment t, data obtained through processing atthe previous moment t−1, and data obtained through processing at a nextmoment t+1 may be mutually nested. For example, output that is of eachlogic gate of the LSTM memory unit and that corresponds to the currentmoment t and output of the memory unit at the current moment t areobtained through processing under a function of the weighttransformation matrix W by using a local image feature x_(t) that isinput at the moment t and an implied variable h_(t−1) obtained throughprocessing at the previous moment t−1. Finally, an implied variableh_(t+1) at the next moment t+1 is obtained based on the output of thememory unit and the output gate at the current moment t.

In this embodiment of this application, output of the bidirectional LSTMis the part relationship feature indicating the relationship between thepart candidate regions, and the part relationship feature is an outputsequence corresponding to an input feature sequence. Therefore, the partrelationship learning unit 114 may merge, in a sequence dimension byusing a method such as linear weighting, output results obtained whenthe bidirectional LSTM model learns a relationship between pedestrianparts, to obtain part relationship features of a to-be-detectedpedestrian in different posture change cases and blocking cases. Forexample, that the output results of the bidirectional LSTM model aremerged in the sequence dimension through linear weighting may be addinga coefficient to a feature that is in the output sequence and thatcorresponds to each moment, and 1 is obtained after all coefficients areadded up. Further, each feature may be multiplied by a coefficientcorresponding to the feature, and then all features obtained throughmultiplication are added, to obtain the part relationship featureobtained by performing merging through linear weighting.

In this embodiment of this application, in a process in which the partrelationship learning unit 114 learns the relationship between thepedestrian parts by using the bidirectional LSTM model, the outputresult of the bidirectional LSTM model is the part relationship featureindicating the relationship between the parts. The part relationshipfeature is a feature used to describe the relationship between the partcandidate regions. The part relationship feature may be directly sent toa local classifier to obtain a classification result (that is, adetection result) indicating whether a pedestrian exists in a pedestriancandidate region in the to-be-detected image. The local classifier maybe a network model that is pre-trained by using a local image feature,in a sample image, corresponding to a pedestrian part and that has acapability, obtained through training, of distinguishing between apedestrian and a background. The local classifier may determine, basedon an input local image feature, whether the local image feature is animage feature including a pedestrian or a background image feature thatdoes not include a pedestrian. In addition, to increase a degree atwhich the overall image feature extracted by the feature extraction unit111 fits to a pedestrian detection task, in this embodiment of thisapplication, the part relationship feature obtained by the partrelationship learning unit 114 through learning may be further mergedwith the overall image feature, to implement pedestrian detection in theto-be-detected image, thereby improving accuracy of pedestriandetection.

S304. Detect the to-be-detected target in the to-be-detected image basedon the part relationship feature obtained by the bidirectional LSTMthrough learning.

In some implementations, the target prediction unit 115 detects theto-be-detected pedestrian in the to-be-detected image based on the partrelationship feature obtained by the bidirectional LSTM throughlearning. The target prediction unit 115 may predict, based on themerged feature, whether the to-be-detected image includes the pedestrianor a location (namely, a location of a pedestrian candidate regionincluding the pedestrian) of the pedestrian included in theto-be-detected image. Optionally, the target prediction unit 115 maymerge, through multi-task learning, the overall image feature extractedby the feature extraction unit 111 with the part relationship featureobtained by the bidirectional LSTM through learning, send the mergedfeature to the local classifier, and learn, by using the localclassifier, the merged feature to obtain a first confidence level ofeach of a category (for example, a pedestrian) and a location of theto-be-detected target in the to-be-detected image. In this embodiment ofthis application, the first confidence level indicates a predictionresult of the local classifier on a possibility that a pedestriancandidate region includes a pedestrian. When the first confidence levelis greater than or equal to a preset threshold, it may be determinedthat the prediction result of the local classification is that thepedestrian candidate region is a pedestrian region including apedestrian; or when the first confidence level is less than a presetthreshold, it may be determined that the pedestrian candidate region isa background region that does not include a pedestrian. If thepedestrian candidate region is a pedestrian region, a location of thepedestrian region in the to-be-detected image may be determined as aspecific location of the pedestrian in the to-be-detected image. Theoverall image feature may be merged with the part relationship featurethrough linear weighting, serial connection, a convolution operation, orin another manner, and a merging manner may be determined according to arequirement in an actual application scenario. This is not limitedherein. The merged feature is equivalent to enabling the overall imagefeature and the local image feature to be complementary to each other,while capturing a structure relationship at an overall layer and a partrelationship at a part layer in various blocking cases, so thatadvantages of the overall structure relationship and the partrelationship are complementary to each other, thereby improving accuracyof pedestrian detection. For example, the overall image feature is animage feature extracted by using a pedestrian as a whole, and representsan image feature of a pedestrian candidate region that may include apedestrian in the to-be-detected image. The part relationship feature isan image feature that corresponds to a pedestrian part and thatrepresents a relationship between pedestrian parts in a part candidateregion, and represents a local image feature of each part of apedestrian in a pedestrian candidate region that may include thepedestrian in the to-be-detected image. For example, the overall imagefeature is used to indicate that a pedestrian candidate region is apedestrian region that may include a pedestrian, and may be used topredict a pedestrian posture at a whole layer, for example, an uprightwalking state, (parts do not block each other). The relationship betweenpedestrian parts that is represented by the part relationship featureincludes two prediction results: a relationship between complete partsof a pedestrian (including all of a head, a left arm, a right arm, aleft hand, a right hand, an upper body, a left leg, a right leg, a leftfoot, and a right foot), or a relationship between parts of twodifferent pedestrians (features such as a head, a left arm, a right arm,a left hand, a right hand, an upper body, a left leg, a right leg, aleft foot, and a right foot are parts of two pedestrians). Therefore,the overall image feature may be merged with the part relationshipfeature to select an image feature that indicates a higher precisepedestrian feature (a prediction result of a pedestrian is reserved).The merged feature may be a feature image that is used to predict, at anentire layer and a local layer, that a local image feature included in apart candidate region is a feature image corresponding to a pedestrianpart. Overall determining by using the overall image feature and localjudgment determining by using the local image feature compensate foreach other, to implement pedestrian detection with higher precision andhigher accuracy.

In this embodiment of this application, a classifier whose input is apart relationship feature corresponding to a pedestrian part and thatpredicts a pedestrian and a background by using the part relationshipfeature is referred to as a local classifier. Correspondingly, aclassifier whose input is an overall image feature and that predicts apedestrian by using the overall image feature is referred to as anoverall classifier.

Optionally, the target prediction unit 115 may further send the overallimage feature of the pedestrian candidate region to the overallclassifier (for example, a Softmax classifier); determine a secondconfidence level at which the pedestrian candidate region includes theto-be-detected target by using the overall classification; determine,based on merging of the first confidence level and the second confidencelevel, that the to-be-detected image includes the to-be-detected target(namely, the pedestrian); and then may determine a location of thepedestrian based on a location of the pedestrian candidate region. Inthis embodiment of this application, the first confidence level is aprediction result of the local classifier on a possibility that apedestrian candidate region includes a pedestrian, and is a predictionresult determined at a pedestrian part layer. The second confidencelevel is a prediction result of the overall classifier on a possibilitythat a pedestrian candidate region includes a pedestrian, and is aprediction result determined at an overall pedestrian layer. When thesecond confidence level is greater than or equal to a preset threshold,it may be determined that the prediction result of the overallclassification is that the pedestrian candidate region is a pedestrianregion including a pedestrian; or when the second confidence level isless than a preset threshold, it may be determined that the predictionresult is that the pedestrian candidate region is a background regionthat does not include a pedestrian. If the pedestrian candidate regionis a pedestrian region, a location of the pedestrian region in theto-be-detected image may be determined as a specific location of thepedestrian in the to-be-detected image. In this embodiment of thisapplication, the first confidence level is merged with the secondconfidence level, so that a more accurate prediction result may beobtained based on a prediction result corresponding to the firstconfidence level with reference to a prediction result corresponding tothe second confidence level, thereby improving prediction precision ofpedestrian detection. In this embodiment of this application, theoverall image feature of the pedestrian candidate region in theto-be-detected image may be merged with the part relationship featurebetween the part candidate regions, and the overall image feature ismerged with the local image feature, to obtain a richer featureexpression, so that a more accurate detection result may be obtained.

A to-be-detected image processing procedure in a pedestrian detectionmethod provided in an embodiment of this application is described belowwith reference to FIG. 7. FIG. 7 is a schematic diagram of ato-be-detected image processing procedure in a pedestrian detectionmethod according to an embodiment of this application. In thisembodiment of this application, a convolutional neural network is firstused to extract a deep image feature for which a pedestrian is used as adetected object, and a pedestrian candidate region and an overall imagefeature corresponding to the pedestrian candidate region are extractedfrom a to-be-detected image. Then with reference to an imagesegmentation network, a local image feature used for part segmentationis extracted and part segmentation is performed in the pedestriancandidate region to obtain a plurality of part candidate regions.Further, local image features corresponding to pedestrian parts may besent to a bidirectional LSTM model, so as to learn, by using thebidirectional LSTM model, a part relationship feature used to describe arelationship between the parts. The part relationship feature may bedirectly used as output of a pedestrian detection result, or may befurther merged with the overall image feature of the pedestriancandidate region, to obtain output of a local classifier shown in FIG.7. It is feasible that the output of the local classifier may bedirectly used as the output of the pedestrian detection result, and thisis simple to operate. Alternatively, the output of the local classifiermay be merged with output of an overall classifier that separately usesan overall image feature of a pedestrian, to obtain a pedestriandetection result shown in FIG. 7. The output of the overall classifieris merged with the output of the local classifier, so that interferencecaused by a part division error can be avoided, thereby improvingaccuracy of pedestrian detection. Therefore, the target detection methodhas higher applicability.

In this embodiment of this application, a part candidate region of apedestrian may be obtained by using the image segmentation network, sothat a part of the pedestrian is obtained more finely, and a posturechange or a blocking status of the pedestrian in the to-be-detectedimage can be more flexibly captured. In addition, in this embodiment ofthis application, a relationship between parts of the pedestrian isobtained by using the bidirectional LSTM, and the part relationshipfeature that can be used to describe the relationship between the partsof the pedestrian is extracted, thereby further improving an imageprocessing capability in a case of a pedestrian posture change or in ablocking status. Therefore, accuracy of identifying a pedestrian part ishigher, and the target detection apparatus has higher applicability.Further, in this embodiment of this application, a multi-task learningmanner is used to merge the overall image feature of the pedestriancandidate region in the to-be-detected image with the local imagefeature of each pedestrian part candidate region in the to-be-detectedimage, so as to diversify features used to determine whether theto-be-detected image includes a pedestrian or a location of a pedestrianin the to-be-detected image, so that different features restrain andpromote each other, thereby increasing accuracy of pedestrian detection.In the pedestrian detection method provided in this embodiment of thisapplication, the overall image feature of the pedestrian is merged withthe part relationship feature of the pedestrian, so that the method isapplicable not only to pedestrian detection in a scenario of a simplepedestrian posture change, but also to pedestrian detection in ascenario of a complex pedestrian posture change. In particular, when apedestrian posture is changed to a relatively large extent or isrelatively seriously blocked, a pedestrian detection rate is higher, andan application scope is wider.

Embodiment 2

In the pedestrian detection method described in Embodiment 1, in oneaspect, an image segmentation network is constructed to obtain a partcandidate region corresponding to each pedestrian part and a local imagefeature corresponding to each part candidate region, and a partrelationship feature between the pedestrian parts is further learned byusing a bidirectional LSTM model. In another aspect, the partrelationship feature learned by the bidirectional LSTM model is mergedwith an overall image feature of a pedestrian candidate region, toimplement pedestrian detection in a to-be-detected image, therebyenhancing an image processing capability of a target detection systemfor a pedestrian posture change and a blocking status in a complexapplication scenario, and implementing optimal detection on a pedestrianin an actual video surveillance scenario.

In addition, in this embodiment of this application, a pedestrian partthat may be included in the to-be-detected image may be obtained basedon two types of labeling information of the pedestrian: an overallrectangular box and a visible box. Then, an obtained image feature ofthe pedestrian part is sent to the bidirectional LSTM model, to learn arelationship between pedestrian parts. Further, pedestrian detection inthe to-be-detected image may be implemented based on the partrelationship feature learned by the bidirectional LSTM and the overallpedestrian feature of the pedestrian in the to-be-detected image.

FIG. 8 is another schematic flowchart of a target detection methodaccording to an embodiment of this application. The target detectionmethod provided in this embodiment of this application may include thefollowing steps.

S401. Obtain a target candidate region in a to-be-detected image.

Optionally, the feature extraction unit 111 extracts an image featurefor which a pedestrian is used as a detected object, and determines apedestrian candidate region from the to-be-detected image by using thetarget candidate region extraction unit 112. For an implementation inwhich the feature extraction unit 111 extracts, from the to-be-detectedimage, an overall image feature of the pedestrian candidate region,refer to the implementation described in step S301 in Embodiment 1.Details are not described herein again.

S402. Obtain a positive sample image feature and a negative sample imagefeature that are used for part identification, and construct a partidentification model based on the positive sample image feature and thenegative sample image feature.

In some implementations, the image segmentation unit 113 constructs thepart identification model by using the positive sample image and thenegative sample image of the pedestrian, so as to identify, by using theconstructed part identification model, a possible pedestrian part fromthe pedestrian candidate region and a local image feature correspondingto each pedestrian part. The image segmentation unit 113 may obtain, byusing a candidate box template in which a pedestrian is used as adetected object, the positive sample image feature and the negativesample image feature that are used for pedestrian detection from asample image used for pedestrian part identification. The candidate boxtemplate may be a pre-constructed template used for part identificationfunction training, and the template is applicable to part identificationfunction training of the part identification model used for pedestriandetection. FIG. 9 is another schematic diagram of a pedestrian candidateregion according to an embodiment of this application. It is assumedthat an ideal state of a pedestrian posture is that a pedestrian is in acenter of an external rectangular box of the pedestrian. In this way,the external rectangular box of the pedestrian may be divided into Ngrids, and a grid covered by a region in which each part of thepedestrian is located in the pedestrian posture in the ideal state isdetermined from the N grids. N is an integer greater than 1. Forexample, it is first assumed that in an ideal case, a pedestrian is in acenter of an external rectangular box of the pedestrian. In this case,the external rectangular box of the pedestrian is evenly divided into aquantity of grids, such as N grids. A grid covered in this ideal case bya region in which a pedestrian part such as a head, a trunk, a left arm,a right arm, a left leg, or a right leg is located may be furtherdetermined.

Optionally, the image segmentation unit 113 may obtain, from a data setused for pedestrian detection training, a sample image used for partidentification. Any sample image is used as an example, and the imagesegmentation unit 113 may determine, from the sample image, a pluralityof candidate regions in which a pedestrian is used as a detected object.FIG. 10 is another schematic diagram of a pedestrian candidate regionaccording to an embodiment of this application. It is assumed that FIG.10 shows a sample image. The sample image includes a pedestrian 3. Theimage segmentation unit 113 may determine, from the sample image, fourcandidate regions in which the pedestrian is used as a detected object,for example, a candidate region 1 to a candidate region 4. Further, theimage segmentation unit 113 may determine a candidate region labeledwith the pedestrian in the plurality of candidate regions as a positivesample region. For example, a candidate box 2, namely, an externalrectangular box of the pedestrian 3, exactly boxes an entire contour ofthe pedestrian 3. Therefore, the candidate box 2 may be labeled inadvance as a positive sample region used to identify the pedestrian 3. Acandidate region whose intersection-over-union with the positive sampleregion is less than a preset proportion is determined as a negativesample region. Optionally, intersection-over-union of two regions may beunderstood as a ratio of an area of an intersection of the two regionsto an area of union of the two regions. The preset proportion may bethat intersection-over-union of two regions is less than 0.5. Forexample, as shown in FIG. 10, intersection-over-union of a candidate box1 and the candidate box 2 is obviously less than 0.5, and therefore thecandidate region 1 may be determined as a negative sample region.Intersection-over-union of the candidate box 3 and the candidate box 2is obviously greater than 0.5, and therefore the candidate region 3 maybe determined as a positive sample region. Further, the imagesegmentation unit 113 may divide the positive sample region into Ngrids, and determine, from the N grids of the positive sample region byusing the candidate box template, a positive sample grid and a negativesample grid that correspond to each part. For example, the imagesegmentation unit 113 may use, as a positive sample region of thepedestrian, an actually labeled pedestrian box (namely, an actuallylabeled external rectangular box of the pedestrian) in the plurality ofcandidate regions; and determine, as negative sample regions of thepedestrian, local image regions corresponding to all candidate boxeswhose intersection-over-union with the actually labeled pedestrian boxis less than 0.5.

In any negative sample region (which is used as a rectangular box regioncorresponding to a negative sample candidate box) in all negative sampleregions, the negative sample region is divided into N grids in themanner of dividing the candidate box template. A grid that correspondsto each pedestrian part and that is obtained by dividing any negativesample region is determined as a negative sample grid of the pedestrianpart. For example, in N grids obtained by dividing a negative sampleregion, a grid corresponding to a head is determined as a negativesample grid of the head, a grid corresponding to a trunk is determinedas a negative sample grid of the trunk, and the like. In any of allpositive sample regions of the pedestrian, the positive sample region isdivided into N grids in the manner of dividing the candidate boxtemplate, and then a part grid covered by each part is determined fromthe N grids of the positive sample region based on a grid that is in thecandidate box template and that is covered by a region in which the partis located. In specific implementation, in any positive sample region,the positive sample region is actually labeled as a pedestrian box, buta posture of the pedestrian in the positive sample region cannot bedetermined, and a real location (different from a location in an idealstate) of each part of the pedestrian is also unknown. In the candidatebox template, a specific location of each part of the pedestrian in anideal case may be known. Therefore, the part grid covered by each partmay be selected from the positive sample region by using the candidatebox template, so as to determine the real location of each part of thepedestrian in the positive sample region. For example, FIG. 11 isanother schematic diagram of a pedestrian candidate region according toan embodiment of this application. The candidate box 2 is used as anexample. After the candidate box 2 used as a positive sample region isdivided into N grids in the manner of dividing the candidate boxtemplate, a part grid covered by each pedestrian part may be determinedfrom the candidate box 2 based on an ideal location of the pedestrianpart in the candidate box template. For example, a part grid covered bya pedestrian head is determined from the N grids of the candidate box 2based on an ideal location of the pedestrian head in the candidate boxtemplate, for example, six grids covered by the head in FIG. 11.Specifically, the six grids are numbered a grid 1, a grid 2 (not shown),a grid 3, a grid 4 (not shown), a grid 5, and a grid 6 (not shown) in amanner of seeing a person first horizontally and then vertically. Thisis not limited herein.

Further, it may be determined, from the positive sample region, avisible region that is of the pedestrian in a real pedestrian posture(different from a pedestrian posture in an ideal state) and that isincluded in the sample image, for example, a visible region of thepedestrian 3 in the candidate box 2. Visible regions of the pedestrian 3in the candidate box 2 include regions covered by parts such as a head,a trunk, a left arm, a left hand, a right arm, a right hand, a left leg,a right leg, and a right foot. For ease of description, a visible regionof a part may be referred to as a part-visible region of the part. Inthe positive sample region, a part-visible region of any part is aregion covered by the part in the positive sample region. A visibleregion of any part may include one or more of the N grids, that is, anypart may cover one or more grids in the positive sample region. Eachgrid may include a part of the part-visible region of the part. Forexample, as shown in FIG. 11, in the actually labeled candidate box 2,the head of the pedestrian 3 covers six grids: the grid 1 to the grid 6.In other words, a head-visible region of the pedestrian 3 in thecandidate box 2 includes the grid 1 to the grid 6. Each of the grid 1 tothe grid 6 includes a part of the head-visible region of the pedestrian3. The image segmentation unit 113 may determine, based on an area of apart-visible region of a part in a grid, whether to determine the gridas a positive sample grid or a negative sample grid of the part. When apart grid covered by any part i includes a part grid j, and a degree atwhich a region covered by the part i in the part grid j overlaps aregion of the part grid j is greater than or equal to a presetthreshold, the part grid j is determined as a positive sample grid ofthe part i. A degree at which a visible region, of the part i, includedin the part grid j overlaps the region of the part grid j is a ratio ofan area of the visible region, of the part i, included in the part gridj to an area of the part grid j, where both i and j are natural numbers.For example, referring to FIG. 11, in the head-visible region shown inthe candidate box 2, degrees at which the header-visible region overlapsthe grid 1 to the grid 6 are compared. If a degree at which ahead-visible region in the grid 1 overlaps the grid 1 is less than apreset threshold, the grid 1 may be determined as a negative sample gridof the head. If degrees at which visible regions in the grid 3 and thegrid 5 overlap the grids are greater than the preset threshold, the grid3 and the grid 5 may be determined as positive sample grids of the head.Likewise, a positive sample grid of each part other than the head may bedetermined from each positive sample region. When a degree at which avisible region of any part i′ in any part grid j′ covered by the part i′overlaps a region of the part grid j′ is less than the preset threshold,the part grid j′ is determined as a negative sample grid of the part i′.Likewise, a negative sample grid of each part other than the head may bedetermined from each negative sample region. For example, in arectangular box region corresponding to any positive sample region, theimage segmentation unit 113 may calculate a degree at which apart-visible region in each grid region in the rectangular box regionoverlaps the grid region, and a preset threshold is preset for anoverlapping degree, to determine whether to add, to a positive samplegrid of a pedestrian part, a grid covered by a visible region part ofthe pedestrian part. If a degree at which a visible region of apedestrian part included in a grid overlaps a region of the grid isgreater than or equal to the preset threshold, the grid is determined asa positive sample grid of the pedestrian part corresponding to the grid.To be specific, if a ratio of an area of a visible region of apedestrian part included in the grid to an area of the grid is greaterthan or equal to the preset threshold, the grid is determined as apositive sample grid of the pedestrian part. If the degree at which thevisible region of the pedestrian part included in the grid overlaps theregion of the grid is less than the preset threshold, the grid isdetermined as a negative sample grid of the pedestrian partcorresponding to the grid, so as to obtain a positive sample grid and anegative sample grid that correspond to each part of the pedestrian.

In this embodiment of this application, after obtaining the positivesample grid corresponding to each part, the image segmentation unit 113may determine an image feature of each positive sample grid as apositive sample image feature of the part, and determine an imagefeature of a negative sample grid region of each part as a negativesample image feature of the part. In this embodiment of thisapplication, the image segmentation unit 113 may determine, by using amassive number of sample images, the positive sample image feature andthe negative sample image feature that correspond to each part of thepedestrian, thereby improving extraction accuracy of an image featurecorresponding to the part, and improving accuracy of part segmentationof the pedestrian. In Embodiment 2, different from Embodiment 1, theimage segmentation unit 113 does not extract a local image featurecorresponding to each part by using an image segmentation network, butobtains, by using massive sample images, the positive sample imagefeature and the negative sample image feature that correspond to eachpart, thereby increasing a manner of extracting the local image featurecorresponding to each part.

In some implementations, after obtaining the positive sample imagefeature and the negative sample image feature of each part, the imagesegmentation unit 113 may use the positive sample image feature of eachpart and the negative sample image feature of each part as input of apart identification model; and learn, by using the part identificationmodel and by using a binary classification problem distinguishingbetween a pedestrian part and a background as a learning objective, acapability of obtaining a local image feature of a pedestrian part. Forexample, a VGG Net obtained by training an ImageNet data set may befirst selected as an original network model used for training, and thena 1000-class classification problem in the original ImageNet data set isreplaced with a binary classification problem distinguishing between apedestrian part and a non-pedestrian part. The VGG Net is trained withreference to the positive sample image feature and the negative sampleimage feature that are used for part identification. An existing networkmodel framework of the VGG Net is used, and the positive sample imagefeature and the negative sample image feature that are used for partidentification is used to train, for the existing network modelframework, a function of distinguishing between a pedestrian part and anon-pedestrian part. A network parameter of the VGG Net is adjusted bytraining the VGG Net, so that the network parameter of the VGG Net is anetwork parameter applicable to pedestrian part identification. Thisprocess may be referred to as a process of constructing a partidentification model used for part identification.

S403. Determine, by using the part identification model from the targetcandidate region, part candidate regions respectively corresponding toat least two parts, where each part candidate region corresponds to onepart of a to-be-detected target; and extract, from the to-be-detectedimage, local image features corresponding to the part candidate regions.

In some implementations, after obtaining, through training, the partidentification model that has the capability of obtaining a local imagefeature of a pedestrian part, the image segmentation unit 113 mayidentify one or more part candidate regions from the pedestriancandidate region in the to-be-detected image by using the partidentification model, and further extract, from the to-be-detectedimage, the local image features corresponding to the part candidateregions. Further, a relationship between pedestrian parts may bedetermined based on local image features corresponding to the pedestrianparts.

S404. Learn the local image features of the part candidate regions byusing a bidirectional LSTM, to obtain a part relationship feature usedto describe a relationship between the part candidate regions.

S405. Detect the to-be-detected target in the to-be-detected image basedon the part relationship feature obtained by the bidirectional LSTMthrough learning.

Optionally, for an implementation in which the relationship between thepart candidate regions is learned by using the bidirectional LSTM, andthe pedestrian in the to-be-detected image is detected based on the partrelationship feature obtained by the bidirectional LSTM through learningand with reference to the overall image feature of the pedestrian in theto-be-detected image, refer to the implementation described in stepsS303 and S304 in Embodiment 1. Details are not described herein again.

In this embodiment of this application, in an earlier data preparationphase in the implementation provided in Embodiment 2, parts of thepedestrian do not need to be separately labeled, and neither apixel-level label nor a rectangular box label is required. Therefore,workload caused by data obtaining in an earlier training phase can bereduced, thereby greatly reducing a time consumed for earlier datapreparation. This is simpler to operate, and implementation complexityof pedestrian detection is reduced. An implementation different fromthat used in Embodiment 1 is used in Embodiment 2 to identify apedestrian part, so as to diversify manners of identifying thepedestrian part, and diversify implementations of pedestrian detectionin the to-be-detected image.

In this embodiment of this application, the target detection apparatus11 shown in FIG. 2 may be configured to perform the target detectionmethod provided in Embodiment 1 and/or Embodiment 2 by using units (ormodules) included in the target detection apparatus 11. To betterdistinguish between different operations performed by the built-in unitsof the target detection apparatus 11 when the target detection apparatus11 performs different embodiments (for example, Embodiment 3 andEmbodiment 4), a target detection apparatus 21 and a target detectionapparatus 31 are used as examples below for description. The targetdetection apparatus 21 may be an apparatus configured to perform thetarget detection method provided in Embodiment 1, and the targetdetection apparatus 31 may be an apparatus configured to perform thetarget detection method provided in Embodiment 2. Implementationsperformed by the target detection apparatuses provided in theembodiments of this application are described below with reference toFIG. 12 to FIG. 13.

Embodiment 3

FIG. 12 is another schematic structural diagram of a target detectionapparatus according to an embodiment of this application. In thisembodiment of this application, a target detection apparatus 21 mayinclude a target candidate region extraction unit 211, an imagesegmentation unit 213, a part relationship learning unit 214, and atarget prediction unit 215.

The target candidate region extraction unit 211 is configured to obtaina target candidate region, in a to-be-detected image, for which a targetis used as a detected object.

The image segmentation unit 213 is configured to determine, by using animage segmentation network, at least two part candidate regions from thetarget candidate region extracted by the target candidate regionextraction unit 211, where each part candidate region corresponds to onepart of a to-be-detected target; and extract, from the to-be-detectedimage, local image features corresponding to the part candidate regions.

The part relationship learning unit 214 is configured to learn, by usinga bidirectional LSTM, the local image features of the part candidateregions extracted by the image segmentation unit 213, to obtain a partrelationship feature used to describe a relationship between the partcandidate regions.

The target prediction unit 215 is configured to detect theto-be-detected target in the to-be-detected image based on the partrelationship feature obtained by the part relationship learning unit213.

Optionally, the target detection apparatus may further include a featureextraction unit 212.

In some implementations, the feature extraction unit 212 is configuredto obtain a global image feature that is in the to-be-detected image andthat corresponds to the target candidate region extracted by the targetcandidate region extraction unit 211.

The target prediction unit 215 is configured to determine theto-be-detected target in the to-be-detected image based on the partrelationship feature obtained by the part relationship learning unit 214and with reference to the global image feature obtained by the featureextraction unit 212.

In some implementations, the target prediction unit 215 is configuredto:

merge the part relationship feature obtained by the part relationshiplearning unit 214 with the global image feature obtained by the featureextraction unit 212; obtain, through learning, a first confidence levelof each of a category and a location of the to-be-detected target in theto-be-detected image based on a merged feature; determine, based on theglobal image feature, a second confidence level at which the targetcandidate region obtained by the target candidate region extraction unitincludes the to-be-detected target, and determine, based on merging ofthe first confidence level and the second confidence level, that theto-be-detected image includes the to-be-detected target; and determine alocation of the to-be-detected target in the to-be-detected image basedon a location, in the to-be-detected image, of the target candidateregion including the to-be-detected target.

In some implementations, the part relationship learning unit 214 isconfigured to: sort, in a preset part sequence, the local image featuresof the part candidate regions extracted by the image segmentation unit213 to obtain a sorted feature sequence, and input the sorted featuresequence to the bidirectional LSTM; and learn, by using thebidirectional LSTM, the relationship between the part candidate regionsby using a binary classification problem distinguishing between a targetand a background as a learning task.

In some implementations, the relationship between the part candidateregions includes at least one of a relationship between the detectedtarget and the part candidate regions, or a dependency relationshipbetween the part candidate regions. The relationship between thedetected target and the part candidate regions includes: a relationshipthat is between a same detected target to which the part candidateregions belong and the part candidate regions and that exists when thepart candidate regions belong to the same detected target, and/or arelationship that is between each of the part candidate regions and adetected target to which the part candidate region belongs and thatexists when the part candidate regions belong to different detectedtargets.

In specific implementation, the target detection apparatus 21 mayperform the implementations provided in steps in Embodiment 1 by usingthe built-in units of the target detection apparatus 21. For details,refer to the implementation performed by a corresponding unit inEmbodiment 1. Details are not described herein again.

In this embodiment of this application, a part candidate region of apedestrian may be obtained by using the image segmentation network, sothat a part of the pedestrian is obtained more finely, and a posturechange or a blocking status of the pedestrian in the to-be-detectedimage can be more flexibly captured. In addition, in this embodiment ofthis application, a relationship between parts of the pedestrian isobtained by using the bidirectional LSTM, and the part relationshipfeature that can be used to describe the relationship between the partsof the pedestrian is extracted, thereby further improving an imageprocessing capability in a case of a pedestrian posture change or in ablocking status. Therefore, accuracy of identifying a pedestrian part ishigher, and the target detection apparatus has higher applicability.Further, in this embodiment of this application, a multi-task learningmanner is used to merge an overall image feature of a pedestriancandidate region in the to-be-detected image with a local image featureof each pedestrian part candidate region in the to-be-detected image, soas to diversify features used to determine whether the to-be-detectedimage includes a pedestrian or a location of a pedestrian in theto-be-detected image, so that different features restrain and promoteeach other, thereby increasing accuracy of pedestrian detection. In thepedestrian detection method provided in this embodiment of thisapplication, the overall image feature of the pedestrian is merged withthe part relationship feature of the pedestrian, so that the method isapplicable not only to pedestrian detection in a scenario of a simplepedestrian posture change, but also to pedestrian detection in ascenario of a complex pedestrian posture change. In particular, when apedestrian posture is changed to a relatively large extent or isrelatively seriously blocked, a pedestrian detection rate is higher, andan application scope is wider.

Embodiment 4

FIG. 13 is another schematic structural diagram of a target detectionapparatus according to an embodiment of this application. In thisembodiment of this application, a target detection apparatus 31 mayinclude a target candidate region extraction unit 311, an imagesegmentation unit 313, a part relationship learning unit 314, and atarget prediction unit 315.

The target candidate region extraction unit 311 is configured to obtaina target candidate region, in a to-be-detected image, for which a targetis used as a detected object.

The image segmentation unit 313 is configured to: obtain a positivesample image feature and a negative sample image feature that are usedfor part identification, and construct a part identification model basedon the positive sample image feature and the negative sample imagefeature.

Optionally, the target detection apparatus 31 may further include afeature extraction unit 312.

The feature extraction unit 312 is configured to obtain a global imagefeature that is in the to-be-detected image and that corresponds to thetarget candidate region extracted by the target candidate regionextraction unit 311.

The image segmentation unit 313 is further configured to identify, byusing the part identification model, at least two part candidate regionsfrom the target candidate region extracted by the target candidateregion extraction unit 311, where each part candidate region correspondsto one part of a to-be-detected target; and extract, from theto-be-detected image, local image features corresponding to the partcandidate regions.

The part relationship learning unit 314 is configured to learn, by usinga bidirectional LSTM, the local image features of the part candidateregions extracted by the image segmentation unit 313, to obtain a partrelationship feature used to describe a relationship between the partcandidate regions.

The target prediction unit 315 is configured to detect theto-be-detected target in the to-be-detected image based on the partrelationship feature obtained by the part relationship learning unit314.

In some implementations, the image segmentation unit 313 is configuredto:

-   -   obtain a candidate box template in which a target is used as a        detected object, divide the candidate box template into N grids,        and determine, from the N grids, a grid covered by a region in        which each part of the target is located, where N is an integer        greater than 1; and obtain a sample image used for part        identification, and determine, from the sample image, a        plurality of candidate regions in which the target is used as a        detected object; determine a candidate region labeled with the        target in the plurality of candidate regions as a positive        sample region of the target, and determine a candidate region        whose intersection-over-union with the positive sample region is        less than a preset proportion as a negative sample region of the        target; divide the positive sample region into N grids, and        determine, from the N grids of the positive sample region based        on the candidate box template, a positive sample grid and a        negative sample grid that correspond to each part; divide the        negative sample region into N grids, and determine a grid that        is in the N grids of the negative sample region and that        correspond to each part as a negative sample grid of the part;        and determine an image feature of a positive sample grid region        of each part as a positive sample image feature of the part, and        determine an image feature of a negative sample grid region of        each part as a negative sample image feature of the part.

In some implementations, the image segmentation unit 313 is configuredto:

-   -   determine, from the N grids of the positive sample region based        on a grid that is in the candidate box template and that is        covered by a region in which each part is located, a part grid        covered by each part; and when a part grid covered by any part i        includes a part grid j, and a degree at which a region covered        by the part i in the part grid j overlaps a region of the part        grid j is greater than or equal to a preset threshold, determine        the part grid j as a positive sample grid of the part i, to        determine a positive sample grid of each part, where both i and        j are natural numbers; or when a part grid covered by any part i        includes a part grid j, and a degree at which a region covered        by the part i in the part grid j overlaps a region of the part        grid j is less than a preset threshold, determine the part grid        j as a negative sample grid of the part i, to determine a        negative sample grid of each part.

In some implementations, the image segmentation unit 313 is configuredto:

-   -   use the positive sample image feature of each part and the        negative sample image feature of each part as input of the part        identification model, and learn, by using the part        identification model and by using a binary classification        problem distinguishing between a target part and a background as        a learning task, a capability of obtaining a local image feature        of the part.

In some implementations, the target prediction unit 315 is configuredto:

-   -   merge the part relationship feature obtained by the part        relationship learning unit 314 with the global image feature        obtained by the feature extraction unit 312; obtain, through        learning, a first confidence level of each of a category and a        location of the to-be-detected target in the to-be-detected        image based on a merged feature; determine, based on the global        image feature, a second confidence level at which the target        candidate region includes the to-be-detected target, and        determine, based on merging of the first confidence level and        the second confidence level, that the to-be-detected image        includes the to-be-detected target; and determine a location of        the to-be-detected target in the to-be-detected image based on a        location, in the to-be-detected image, of the target candidate        region including the to-be-detected target.

In some implementations, the part relationship learning unit 314 isconfigured to:

-   -   sort, in a preset part sequence, the local image features of the        part candidate regions obtained by the image segmentation unit        313 to obtain a sorted feature sequence, and input the sorted        feature sequence to the bidirectional LSTM; and learn, by using        the bidirectional LSTM, the relationship between the part        candidate regions by using a binary classification problem        distinguishing between a target and a background as a learning        task.

In some implementations, the relationship between the part candidateregions includes at least one of a relationship between the detectedtarget and the part candidate regions, or a dependency relationshipbetween the part candidate regions. The relationship between thedetected target and the part candidate regions includes: a relationshipthat is between a same detected target to which the part candidateregions belong and the part candidate regions and that exists when thepart candidate regions belong to the same detected target, and/or arelationship that is between each of the part candidate regions and adetected target to which the part candidate region belongs and thatexists when the part candidate regions belong to different detectedtargets.

In specific implementation, the target detection apparatus 31 mayperform the implementations provided in steps in Embodiment 2 by usingthe built-in units of the target detection apparatus 31. For details,refer to the implementation performed by a corresponding unit inEmbodiment 2. Details are not described herein again.

In this embodiment of this application, in an earlier data preparationphase, the target detection apparatus 31 provided in Embodiment 4 doesnot need to separately label parts of a pedestrian, and neither apixel-level label nor a rectangular box label is required. Therefore,workload caused by data obtaining in an earlier training phase can bereduced, thereby greatly reducing a time consumed for earlier datapreparation. This is simpler to operate, and implementation complexityof pedestrian detection is reduced. An implementation different fromthat used by the target detection apparatus 21 in Embodiment 3 is usedin Embodiment 4 to identify a pedestrian part, so as to diversifymanners of identifying the pedestrian part, and diversifyimplementations of pedestrian detection in the to-be-detected image.

A person of ordinary skill in the art may understand that all or some ofthe processes of the methods in the embodiments may be implemented by acomputer program instructing related hardware. The program may be storedin a computer readable storage medium. When the program runs, theprocesses of the methods in the embodiments are performed. The foregoingstorage medium includes: any medium that can store program code, such asa ROM, a random access memory RAM, a magnetic disk, or an optical disc.

What is claimed is:
 1. A method, comprising: obtaining a targetcandidate region in a to-be-detected image; determining at least twopart candidate regions from the target candidate region by using animage segmentation network, wherein each part candidate regioncorresponds to one part of a to-be-detected target; and extracting, fromthe to-be-detected image, local image features corresponding to the partcandidate regions; learning the local image features of the partcandidate regions by using a bidirectional long short-term memory (LSTM)network, to obtain a part relationship feature used to describe arelationship between the part candidate regions; and detecting theto-be-detected target in the to-be-detected image based on the partrelationship feature.
 2. The method according to claim 1, wherein thedetecting the to-be-detected target in the to-be-detected image based onthe part relationship feature comprises: determining the to-be-detectedtarget in the to-be-detected image based on the part relationshipfeature with reference to a global image feature, wherein the globalimage feature corresponds to the target candidate region; and thedetecting the to-be-detected target in the to-be-detected image based onthe part relationship feature further comprises: obtaining the globalimage feature corresponding to the target candidate region.
 3. Themethod according to claim 2, wherein the detecting the to-be-detectedtarget in the to-be-detected image based on the part relationshipfeature with reference to a global image feature comprises: merging thepart relationship feature with the global image feature, and obtaining,through learning, a first confidence level of each of a category and alocation of the to-be-detected target in the to-be-detected image basedon a merged feature; determining, based on the global image feature, asecond confidence level at which the target candidate region comprisesthe to-be-detected target; and determining, based on merging of thefirst confidence level and the second confidence level, that theto-be-detected image comprises the to-be-detected target; anddetermining a location of the to-be-detected target in theto-be-detected image based on a location of the target candidate regionin the to-be-detected image.
 4. The method according to claim 1, whereinthe learning the local image features of the part candidate regions byusing the LSTM network comprises: sorting the local image features ofthe part candidate regions in a preset sequence to obtain a sortedfeature sequence, and inputting the feature sequence to the LSTMnetwork; and learning, by using the LSTM network, the relationshipbetween the part candidate regions by using a binary classificationproblem distinguishing between a target and a background as a learningtask.
 5. The method according to claim 4, wherein the relationshipbetween the part candidate regions comprises at least one of arelationship between the to-be-detected target and the part candidateregions, or a dependency relationship between the part candidateregions.
 6. A method, comprising: obtaining a target candidate region ina to-be-detected image; obtaining a positive sample image feature and anegative sample image feature that are used for part identification, andconstructing a part identification model based on the positive sampleimage feature and the negative sample image feature; identifying atleast two part candidate regions from the target candidate region byusing the part identification model, wherein each part candidate regioncorresponds to one part of a to-be-detected target; and extracting, fromthe to-be-detected image, local image features corresponding to the partcandidate regions; learning the local image features of the partcandidate regions by using a bidirectional long short-term memory (LSTM)network, to obtain a part relationship feature used to describe arelationship between the part candidate regions; and detecting theto-be-detected target in the to-be-detected image based on the partrelationship feature.
 7. The method according to claim 6, wherein theobtaining the positive sample image feature and the negative sampleimage feature that are used for part identification comprises: obtaininga candidate box template, dividing the candidate box template into Ngrids, and determining, from the N grids, a grid covered by a region inwhich each part of the target is located, wherein N is an integergreater than 1; obtaining a sample image used for part identification,and determining a plurality of candidate regions from the sample image;determining a candidate region labeled with the target in the pluralityof candidate regions as a positive sample region of the target, anddetermining a candidate region whose intersection-over-union with thepositive sample region is less than a preset proportion as a negativesample region of the target; dividing the positive sample region into Ngrids, and determining, from the N grids of the positive sample regionbased on the candidate box template, a positive sample grid and anegative sample grid that correspond to each part; dividing the negativesample region into N grids, and determining a grid that is in the Ngrids of the negative sample region and that corresponds to a respectivepart as a negative sample grid of the part; and determining an imagefeature of a positive sample grid region of each part as a positivesample image feature of the part, and determining an image feature of anegative sample grid region of each part as a negative sample imagefeature of the part.
 8. The method according to claim 7, wherein thedetermining, from the N grids of the positive sample region based on thecandidate box template, the positive sample grid and the negative samplegrid that correspond to each part comprises: determining, from the Ngrids of the positive sample region based on a grid that is in thecandidate box template and that is covered by a region in which eachpart is located, one or more part grids covered by the part; and whenone or more part grids covered by any part i comprises a part grid j,and a degree at which a region covered by the part i in the part grid joverlaps a region of the part grid j is greater than or equal to apreset threshold, determining the part grid j as a positive sample gridof the part i, to determine a positive sample grid of each part, whereinboth i and j are natural numbers.
 9. The method according to claim 6,wherein the constructing the part identification model based on thepositive sample image feature and the negative sample image featurecomprises: using the positive sample image feature of each part and thenegative sample image feature of each part as input of the partidentification model, and learning, by using the part identificationmodel and by using a binary classification problem distinguishingbetween a target part and a background as a learning task, a capabilityof obtaining a local image feature of the part.
 10. The method accordingto claim 9, wherein the detecting the to-be-detected target in theto-be-detected image based on the part relationship feature comprises:merging the part relationship feature with a global image feature, andobtaining, through learning, a first confidence level of each of acategory and a location of the to-be-detected target in theto-be-detected image based on a merged feature, wherein the global imagefeature corresponds to the target candidate region; determining, basedon the global image feature, a second confidence level at which thetarget candidate region comprises the to-be-detected target; anddetermining, based on merging of the first confidence level and thesecond confidence level, that the to-be-detected image comprises theto-be-detected target; and determining a location of the to-be-detectedtarget in the to-be-detected image based on a location, in theto-be-detected image, of the target candidate region comprising theto-be-detected target; and the detecting the to-be-detected target inthe to-be-detected image based on the part relationship feature furthercomprises: obtaining the global image feature corresponding to thetarget candidate region.
 11. The method according to claim 10, whereinthe learning the local image features of the part candidate regions byusing the LSTM network comprises: sorting the local image features ofthe part candidate regions in a preset sequence to obtain a sortedfeature sequence, and inputting the feature sequence to the LSTMnetwork; and learning, by using the bidirectional long short-term memory(LSTM) network, the relationship between the part candidate regions byusing a binary classification problem distinguishing between a targetand a background as a learning task.
 12. The method according to claim11, wherein the relationship between the part candidate regionscomprises at least one of a relationship between the to-be-detectedtarget and the part candidate regions, or a dependency relationshipbetween the part candidate regions.
 13. The method according to claim 7,wherein the determining, from the N grids of the positive sample regionbased on the candidate box template, the positive sample grid and thenegative sample grid that correspond to each part comprises:determining, from the N grids of the positive sample region based on agrid that is in the candidate box template and that is covered by aregion in which each part is located, one or more part grids covered bythe part; and when one or more part grids covered by any part icomprises a part grid j, and a degree at which a region covered by thepart i in the part grid j overlaps a region of the part grid j is lessthan a preset threshold, determining the part grid j as a negativesample grid of the part i, to determine a negative sample grid of eachpart, wherein both i and j are natural numbers.
 14. A computer device,wherein the computer device comprises: a processor; and a memory,wherein the memory is configured to store a program instruction, andwhen the processor invokes the program instruction, the programinstruction enables the processor to perform a method according to thefollowing steps: obtaining a target candidate region in a to-be-detectedimage; determining at least two part candidate regions from the targetcandidate region by using an image segmentation network, wherein eachpart candidate region corresponds to one part of a to-be-detectedtarget; and extracting, from the to-be-detected image, local imagefeatures corresponding to the part candidate regions; learning the localimage features of the part candidate regions by using a bidirectionallong short-term memory (LSTM) network, to obtain a part relationshipfeature used to describe a relationship between the part candidateregions; and detecting the to-be-detected target in the to-be-detectedimage based on the part relationship feature.